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Abstract 

Deep Reinforcement Learning has yielded proficient 
controllers for complex tasks. However, these con¬ 
trollers have limited memory and rely on being able 
to perceive the complete game screen at each deci¬ 
sion point. To address these shortcomings, this arti¬ 
cle investigates the effects of adding recurrency to 
a Deep Q-Network (DQN) by replacing the first 
post-convolutional fully-connected layer with a recur¬ 
rent LSTM. The resulting Deep Recurrent Q-Network 
(DRQN), although capable of seeing only a single frame 
at each timestep, successfully integrates information 
through time and replicates DQN’s performance on 
standard Atari games and partially observed equivalents 
featuring flickering game screens. Additionally, when 
trained with partial observations and evaluated with in¬ 
crementally more complete observations, DRQN’s per¬ 
formance scales as a function of observability. Con¬ 
versely, when trained with full observations and eval¬ 
uated with partial observations, DRQN’s performance 
degrades less than DQN’s. Thus, given the same length 
of history, recurrency is a viable alternative to stacking 
a history of frames in the DQN’s input layer and while 
recurrency confers no systematic advantage when learn¬ 
ing to play the game, the recurrent net can better adapt 
at evaluation time if the quality of observations changes. 

Introduction 

Deep Q-Networks (DQNs) have been shown to be capable 
of learning human-level control policies on a variety of dif¬ 
ferent Atari 2600 games (Mnih et al. 2015). True to their 
name, DQNs learn to estimate the Q-Values (or long-term 
discounted returns) of selecting each possible action from 
the current game state. Given that the network’s Q-Value 
estimate is sufficiently accurate, a game may be played 
by selecting the action with the maximal Q-Value at each 
timestep. Learning policies mapping from raw screen pix¬ 
els to actions, these networks have been shown to achieve 
state-of-the-art performance on many Atari 2600 games. 

However, Deep Q-Networks are limited in the sense that 
they learn a mapping from a limited number of past states, 
or game screens in the case of Atari 2600. In practice, DQN 
is trained using an input consisting of the last four states the 
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(a) Pong (b) Frostbite (c) Double Dunk 

Figure 1: Nearly all Atari 2600 games feature moving ob¬ 
jects. Given only one frame of input, Pong, Frostbite, and 
Double Dunk are all POMDPs because a single observation 
does not reveal the velocity of the ball (Pong, Double Dunk) 
or the velocity of the icebergs (Frostbite). 


agent has encountered. Thus DQN will be unable to master 
games that require the player to remember events more dis¬ 
tant than four screens in the past. Put differently, any game 
that requires a memory of more than four frames will appear 
non-Markovian because the future game states (and rewards) 
depend on more than just DQN’s current input. Instead of 
a Markov Decision Process (MDP), the game becomes a 
Partially-Observable Markov Decision Process (POMDP). 

Real-world tasks often feature incomplete and noisy state 
information resulting from partial observability. As Figure 
1 shows, given only a single game screen, many Atari 2600 
games are POMDPs. One example is the game of Pong in 
which the current screen only reveals the location of the pad¬ 
dles and the ball, but not the velocity of the ball. Knowing 
the direction of travel of the ball is a crucial component for 
determining the best paddle location. 

We observe that DQN’s performance declines when given 
incomplete state observations and hypothesize that DQN 
may be modified to better deal with POMDPs by leverag¬ 
ing advances in Recurrent Neural Networks. Therefore we 
introduce the Deep Recurrent Q-Network (DRQN), a com¬ 
bination of a Long Short Term Memory (LSTM) (Hochreiter 
and Schmidhuber 1997) and a Deep Q-Network. Crucially, 
we demonstrate that DRQN is capable of handing partial ob¬ 
servability, and that when trained with full observations and 
evaluated with partial observations, DRQN better handles 








the loss of information than does DQN. Thus, recurrency 
confers benefits as the quality of observations degrades. 

Deep Q-Learning 

Reinforcement Learning (Sutton and Barto 1998) is con¬ 
cerned with learning control policies for agents interacting 
with unknown environments. Such environments are often 
formalized as a Markov Decision Processes (MDPs), de¬ 
scribed by a 4-tuple {S^A^V^IZ). At each timestep t an 
agent interacting with the MDP observes a state St G S, 
and chooses an action at ^ A which determines the reward 
n ^ Tl{st, at) and next state ^ V{st, at). 

Q-Leaming (Watkins and Dayan 1992) is a model-free 
off-policy algorithm for estimating the long-term expected 
return of executing an action from a given state. These esti¬ 
mated returns are known as Q-values. A higher Q-value in¬ 
dicates an action a is judged to yield better long-term results 
in a state 5 . Q-values are learned iteratively by updating the 
current Q-value estimate towards the observed reward plus 
the max Q-value over all actions a' in the resulting state s'\ 

Q{s,a) := Q(5,a) + a(r+7maxQ(5',a')-Q(5,a)) (1) 

Many challenging domains such as Atari games feature 
far too many unique states to maintain a separate estimate 
for each S x A. Instead a model is used to approximate 
the Q-values (Mnih et al. 2015). In the case of Deep Q- 
Learning, the model is a neural network parameterized by 
weights and biases collectively denoted as 0. Q-values are 
estimated online by querying the output nodes of the net¬ 
work after performing a forward pass given a state input. 
Such Q-values are denoted Q{s, a\0). Instead of updating in¬ 
dividual Q-values, updates are now made to the parameters 
of the network to minimize a differentiable loss function: 

L{s, a\0i) = (r -h 7max(5(5', a'\0i) — Q{s, a\0i))^ (2) 

0i^,=0i^a\/eL{0i) (3) 

Since \0\ \S X A\, the neural network model natu¬ 
rally generalizes beyond the states and actions it has been 
trained on. However, because the same network is gener¬ 
ating the next state target Q-values that are used in updat¬ 
ing its current Q-values, such updates can oscillate or di¬ 
verge (Tsitsiklis and Roy 1997). Deep Q-Learning uses three 
techniques to restore learning stability: First, experiences 
et = {st^at^rt^ ^t-\-i) are recorded in a replay memory V 
and then sampled uniformly at training time. Second, a sep¬ 
arate, target network Q provides update targets to the main 
network, decoupling the feedback resulting from the net¬ 
work generating its own targets. Q is identical to the main 
network except its parameters 0~ are updated to match 9 
every 10,000 iterations. Finally, an adaptive learning rate 
method such as RMSProp (Tieleman and Hinton 2012) or 
ADADELTA (Zeiler 2012) maintains a per-parameter learn¬ 
ing rate a, and adjusts a according to the history of gradient 
updates to that parameter. This step serves to compensate for 
the lack of a fixed training dataset; the ever-changing nature 


of V may require certain parameters start changing again 
after having reached a seeming fixed point. 

At each training iteration i, an experience et = 
(st, Ut, rt, 5t+i) is sampled uniformly from the replay mem¬ 
ory V. The loss of the network is determined as follows: 

Li{ 0 i) ^(st,at,rt,st+i)^T> (jji 5 5 (A) 

where i/i = + 7 max^/Q(5t+i, a'; 6>“) is the stale 

update target given by the target network Q. Updates per¬ 
formed in this manner have been empirically shown to be 
tractable and stable (Mnih et al. 2015). 

Partial Observability 

In real world environments it’s rare that the full state of the 
system can be provided to the agent or even determined. In 
other words, the Markov property rarely holds in real world 
environments. A Partially Observable Markov Decision Pro¬ 
cess (POMDP) better captures the dynamics of many real- 
world environments by explicitly acknowledging that the 
sensations received by the agent are only partial glimpses 
of the underlying system state. Formally a POMDP can be 
described as a 6-tuple {S, A,V,7l,ft,0). S,A,V,1Z are 
the states, actions, transitions, and rewards as before, except 
now the agent is no longer privy to the true system state 
and instead receives an observation o G U. This observa¬ 
tion is generated from the underlying system state accord¬ 
ing to the probability distribution o ^ 0{s). Vanilla Deep 
Q-Learning has no explicit mechanisms for deciphering the 
underlying state of the POMDP and is only effective if the 
observations are reflective of underlying system states. In 
the general case, estimating a Q-value from an observation 
can be arbitrarily bad since Q{o,a\0) ^ Q{s^a\0). 

Our experiments show that adding recurrency to Deep Q- 
Learning allows the Q-network network to better estimate 
the underlying system state, narrowing the gap between 
Q{o,a\0) and Q{s,a\0). Stated differently, recurrent deep 
Q-networks can better approximate actual Q-values from se¬ 
quences of observations, leading to better policies in par¬ 
tially observed environments. 

DRQN Architecture 

To isolate the effects of recurrency, we minimally modify 
the architecture of DQN, replacing only its first fully con¬ 
nected layer with a recurrent LSTM layer of the same size. 
Depicted in Figure 2, the architecture of DRQN takes a sin¬ 
gle 84 X 84 preprocessed image. This image is processed 
by three convolutional layers (Cun et al. 1998) and the out¬ 
puts are fed to the fully connected LSTM layer (Hochreiter 
and Schmidhuber 1997). Finally, a linear layer outputs a Q- 
Value for each action. During training, the parameters for 
both the convolutional and recurrent portions of the network 
are learned jointly from scratch. We settled on this archi¬ 
tecture after experimenting with several variations; see Ap¬ 
pendix A for details. 
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Figure 2: DRQN convolves three times over a single-channel 
image of the game screen. The resulting activations are 
processed through time by an LSTM layer. The last two 
timesteps are shown here. LSTM outputs become Q-Values 
after passing through a fully-connected layer. Convolutional 
filters are depicted by rectangular sub-boxes with pointed 
tops. 


Stable Recurrent Updates 

Updating a recurrent, convolutional network requires each 
backward pass to contain many time-steps of game screens 
and target values. Additionally, the LSTM’s initial hidden 
state may either be zeroed or carried forward from its previ¬ 
ous values. We consider two types of updates: 

Bootstrapped Sequential Updates: Episodes are se¬ 
lected randomly from the replay memory and updates be¬ 
gin at the beginning of the episode and proceed forward 
through time to the conclusion of the episode. The targets 
at each timestep are generated from the target Q-network, 
Q. The RNN’s hidden state is carried forward throughout 
the episode. 

Bootstrapped Random Updates: Episodes are selected 
randomly from the replay memory and updates begin at ran¬ 
dom points in the episode and proceed for only unroll itera¬ 
tions timesteps (e.g. one backward call). The targets at each 
timestep are generated from the target Q-network, Q. The 
RNN’s initial state is zeroed at the start of the update. 

Sequential updates have the advantage of carrying the 
LSTM’s hidden state forward from the beginning of the 
episode. However, by sampling experiences sequentially for 
a full episode, they violate DQN’s random sampling policy. 


Random updates better adhere to the policy of randomly 
sampling experience, but, as a consequence, the LSTM’s 
hidden state must be zeroed at the start of each update. Ze¬ 
roing the hidden state makes it harder for the LSTM to learn 
functions that span longer time scales than the number of 
timesteps reached by back propagation through time. 

Experiments indicate that both types of updates are vi¬ 
able and yield convergent policies with similar performance 
across a set of games. Therefore, to limit complexity, all re¬ 
sults herein use the randomized update strategy. We expect 
that all presented results would generalize to the case of se¬ 
quential updates. 

Having addressed the architecture and updating of a Deep 
Recurrent Q-Network, we now show how it performs on do¬ 
mains featuring partial observability. 

Atari Games: MDP or POMDP? 

The state of an Atari 2600 game is fully described by the 
128 bytes of console RAM. Humans and agents, however, 
observe only the console-generated game screens. For many 
games, a single game screen is insufficient to determine the 
state of the system. DQN infers the full state of an Atari 
game by expanding the state representation to encompass 
the last four game screens. Many games that were previ¬ 
ously POMDPs now become MDPs. Of the 49 games in¬ 
vestigated by (Mnih et al. 2015), the authors were unable to 
identify any that were partially observable given the last four 
frames of input. ^ Since the explored games are fully observ¬ 
able given four input frames, we need a way to introduce 
partial observability without reducing the number of input 
frames given to DQN. 

Flickering Atari Games 

To address this problem, we introduce the Flickering Pong 
POMDP - a modification to the classic game of Pong such 
that at each timestep, the screen is either fully revealed or 
fully obscured with probability p = 0.5. Obscuring frames 
in this manner probabilistically induces an incomplete mem¬ 
ory of observations needed for Pong to become a POMDP. 

In order to succeed at the game of Flickering Pong, it is 
necessary to integrate information across frames to estimate 
relevant variables such as the location and velocity of the 
ball and the location of the paddle. Since half of the frames 
are obscured in expectation, a successful player must be ro¬ 
bust to the possibility of several potentially contiguous ob¬ 
scured inputs. 

Perhaps the most important opportunity presented by a 
history of game screens is the ability to convolutionally de¬ 
tect object velocity. Figure 3 visualizes the game screens 
maximizing the activations of different convolutional filters 
and confirms that the lO-frame DQN’s filters do detect ob¬ 
ject velocity, though perhaps less reliably than normal unob¬ 
scured Pong.^ 

^Some Atari games are undoubtedly POMDPs such as Black¬ 
jack in which the dealer’s cards are hidden from view. Unfortu¬ 
nately, Blackjack is not supported by the ALE emulator. 

^(Guo et al. 2014) also confirms that convolutional filters learn 
to respond to patterns of movement seen in game objects. 





(a) Convl Filters 



(b) Conv2 Filters 



(c) Conv3 Filters 



(d) Image sequences maximizing three sample LSTM units 


Figure 3: Sample convolution filters learned by 10-frame 
DQN on the game of Pong. Each row plots the input frames 
that trigger maximal activation of a particular convolutional 
filter in the specified layer. The red bounding box illustrates 
the portion of the input image that caused the maximal ac¬ 
tivation. Most filters in the first convolutional layer detect 
only the paddle. Conv2 filters begin to detect ball movement 
in particular directions and some jointly track the ball and 
the paddle. Nearly all Conv3 filters track ball and paddle in¬ 
teractions including defiections, ball velocity, and direction 
of travel. Despite seeing a single frame at a time, individual 
LSTM units also detect high level events, respectively: the 
agent missing the ball, ball refiections off of paddles, and 
ball refiections off the walls. Each image superimposes the 
last 10-frames seen by the agent, giving more luminance to 
the more recent frames. 


Remarkably, DRQN performs well at this task even when 
given only one input frame per timestep. With a single frame 
it is impossible for DRQN’s convolutional layers to detect 
any type of velocity. Instead, the higher-level recurrent layer 
must compensate for both the flickering game screen and 
the lack of convolutional velocity detection. Eigure 3d con¬ 
firms that individual units in the LSTM layer are capable of 
integrating noisy single-frame information through time to 
detect high-level Pong events such as the player missing the 
ball, the ball refiecting on a paddle, or the ball refiecting off 
the wall. 

DRQN is trained using backpropagation through time for 
the last ten timesteps. Thus both the non-recurrent 10-frame 
DQN and the recurrent 1-frame DRQN have access to the 
same history of game screens.^ Thus, when dealing with 
partial observability, a choice exists between using a non¬ 
recurrent deep network with a long history of observations 
or using a recurrent network trained with a single observa¬ 
tion at each timestep. The results in this section show that 
recurrent networks can integrate information through time 
and serve as a viable alternative to stacking frames in the 
input layer of a convoluational network. 

Evaluation on Standard Atari Games 

We selected the following nine Atari games for evalua¬ 
tion: Asteroids and Double Dunk feature naturally-dickering 
sprites making them good potential candidates for recurrent 
learning. Beam Rider, Centipede, and Chopper Command 
are shooters. Frostbite is a platformer similar to Erogger. Ice 
Hockey and Double Dunk are sports games that require po¬ 
sitioning players, passing and shooting the puck/bail, and 
require the player to be capable of both offense and defense. 
Bowling requires actions to be taken at a specific time in or¬ 
der to guide the ball. Ms Pacman features flickering ghosts 
and power pills. 

Given the last four frames of input, all of these games 
are MDPs rather than POMDPs. Thus there is no reason to 
expect DRQN to outperform DQN. Indeed, results in Ta¬ 
ble 1 indicate that on average, DRQN does roughly as well 
DQN. Specifically, our re-implementation of DQN performs 
similarly to the original, outperforming the original on five 
out of the nine games, but achieving less than half the origi¬ 
nal score on Centipede and Chopper Command. DRQN per¬ 
forms outperforms our DQN on the games of Erostbite and 
Double Dunk, but does significantly worse on the game of 
Beam Rider (Eigure 4). The game of Erostbite (Eigure lb) 
requires the player to jump across all four rows of moving 
icebergs and return to the top of the screen. After traversing 
the icebergs several times, enough ice has been collected to 
build an igloo at the top right of the screen. Subsequently 
the player can enter the igloo to advance to the next level. 
As shown in Eigure 4, after 12,000 episodes DRQN discov¬ 
ers a policy that allows it to reliably advance past the first 
level of Erostbite. Eor experimental details, see Appendix B. 

^However, (Karpathy, Johnson, and Li 2015) show that LSTMs 
can learn functions at training time over a limited set of timesteps 
and then generalize them at test time to longer sequences. 

^Statistical significance of scores determined by independent t- 














DRQN ±std DQN ±std 


Game 


Ours 

Mnih et al. 

Asteroids 

1020 (±312) 

1070 (±345) 

1629 (±542) 

Beam Rider 

3269 (±1167) 

6923 (±1027) 

6846 (±1619) 

Bowling 

62 (±5.9) 

72 (±11) 

42 (±88) 

Centipede 

3534 (±1601) 

3653 (±1903) 

8309 (±5237) 

Chopper Cmd 

2070 (±875) 

1460 (±976) 

6687 (±2916) 

Double Dunk 

-2 (±7.8) 

-10 (±3.5) 

-18.1 (±2.6) 

Frostbite 

2875 (±535) 

519 (±363) 

328.3 (±250.5) 

Ice Hockey 

-4.4 (±1.6) 

-3.5 (±3.5) 

-1.6 (±2.5) 

Ms. Pacman 

2048 (±653) 

2363 (±735) 

2311 (±525) 


Table 1: On standard Atari games, DRQN performance par¬ 
allels DQN, excelling in the games of Frostbite and Double 
Dunk, but struggling on Beam Rider. Bolded font indicates 
statistical significance between DRQN and our DQN.^ 
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Figure 4: Frostbite and Beam Rider represent the best and 
worst games for DRQN. Frostbite performance jumps as the 
agent learns to reliably complete the first level. 
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MDP to POMDP Generalization 

Can a recurrent network be trained on a standard MDP and 
then generalize to a POMDP at evaluation time? To ad¬ 
dress this question, we evaluate the highest-scoring policies 
of DRQN and DQN over the flickering equivalents of all 
9 games in Table 1. Figure 5 shows that while both algo¬ 
rithms incur significant performance decreases on account 
of the missing information, DRQN captures more of its pre¬ 
vious performance than DQN across all levels of flickering. 
We conclude that recurrent controllers have a certain degree 
of robustness against missing information, even trained with 
full state information. 


Observation Probability 


Figure 5: When trained on normal games (MDPs) and then 
evaluated on flickering games (POMDPs), DRQN’s perfor¬ 
mance degrades more gracefully than DQN’s. Each data 
point shows the average percentage of the original game 
score over all 9 games in Table 1. 


Related Work 

Previously, LSTM networks have been demonstrated to 
solve POMDPs when trained using policy gradient meth¬ 
ods (Wierstra et al. 2007). In contrast to policy gradient, 
our work uses temporal-difference updates to bootstrap an 
action-value function. Additionally, by jointly training con¬ 
volutional and LSTM layers we are able to learn directly 
from pixels and do not require hand-engineered features. 

LSTM has been used as an advantage-function approxi¬ 
mator and shown to solve a partially observable corridor and 
cartpole tasks better better than comparable (non-LSTM) 
RNNs (Bakker 2001). While similar in principle, the cor¬ 
ridor and cartpole tasks feature tiny states spaces with just a 
few features. 

In parallel to our work, (Narasimhan, Kulkarni, and Barzi- 
lay 2015) independently combined LSTM with Deep Re¬ 
inforcement Learning to demonstrate that recurrency helps 
to better play text-based fantasy games. The approach is 
similar but the domains differ: despite the apparent com¬ 
plexity of the fantasy-generated text, the underlying MDPs 
feature relatively low-dimensional manifolds of underlying 
state space. The more complex of the two games features 
only 56 underlying states. Atari games, in contrast, feature 
a much richer state space with typical games having mil¬ 
lions of different states. However, the action space of the text 
games is much larger with a branching factor of 222 versus 
Atari’s 18. 


Discussion and Conclusion 


tests using Benjamini-Hochberg procedure and significance level 
P = .05. 


Real-world tasks often feature incomplete and noisy state 
information, resulting from partial observability. We mod¬ 
ify DQN to handle the noisy observations characteristic of 











POMDPs by combining a Long Short Term Memory with a 
Deep Q-Network. The resulting Deep Recurrent Q-Network 
(DRQN), despite seeing only a single frame at each step, is 
still capable integrating information across frames to detect 
relevant information such as velocity of on-screen objects. 
Additionally, on the game of Pong, DRQN is better equipped 
than a standard Deep Q-Network to handle the type of partial 
observability induced by flickering game screens. 

Furthermore, when trained with partial observations, 
DRQN can generalize its policies to the case of complete 
observations. On the Flickering Pong domain, performance 
scales with the observability of the domain, reaching near¬ 
perfect levels when every game screen is observed. This re¬ 
sult indicates that the recurrent network learns policies that 
are both robust enough to handle to missing game screens 
and scalable enough to improve performance as observabil¬ 
ity increases. Generalization also occurs in the opposite di¬ 
rection: when trained on standard Atari games and evaluated 
against flickering games, DRQN’s performance generalizes 
better than DQN’s at all levels of partial information. 

Our experiments suggest that Pong represents an outlier 
among the examined games. Across a set of ten Flicker¬ 
ing MDPs we observe no systematic improvement when 
employing recurrency. Similarly, across non-flickering Atari 
games, there are few signiflcant differences between the re¬ 
current and non-recurrent player. This observation leads us 
to conclude that while recurrency is a viable method for 
handling state observations, it confers no systematic bene¬ 
fit compared to stacking the observations in the input layer 
of a convolutional network. One avenue for future is identi¬ 
fying the relevant characteristics of the Pong and Frostbite 
that lead to better performance by recurrent networks. 
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Appendix A: Alternative Architectures 

Several alternative architectures were evaluated on the game 
of Beam Rider. We explored the possibility of either replac¬ 
ing the first non-convolutional fully connected layer with an 
LSTM layer (LSTM replaces IPl) or adding the LSTM layer 
between the first and second fully connected layers (LSTM 
over IPl). Results strongly indicated LSTM should replace 
IPL We hypothesize this allows LSTM direct access to the 
convolutional features. Additionally, adding a Rectifier layer 
after the LSTM layer consistently reduced performance. 


Description 

Percent Improvement 

LSTM replaces IPl 

709% 

ReLU-LSTM replaces IPl 

533% 

LSTM over IPl 

418% 

ReLU-LSTM over IPl 

0% 


Another possible architecture combines frame stacking 
from DQN with the recurrency of LSTM. This architecture 
accepts a stack of the four latest frames at every timestep. 
The LSTM portion of the architecture remains the same and 
is unrolled over the last 10 timesteps. In theory, this modifi¬ 
cation should allow velocity detection to happen in the con¬ 
volutional layers of the network, leaving the LSTM free to 
perform higher-order processing. This architecture has the 
largest number of parameters and requires the most training 
time. Unfortunately, results show that the additional parame¬ 
ters do not lead to increased performance on the set of games 
examined. It is possible that the network has too many pa¬ 
rameters and is prone to overfitting the training experiences 
it has seen. 

Appendix B: Computational Efficiency 

Computational efficiency of RNNs is an important concern. 
We conducted experiments by performing 1000 backwards 
and forwards passes and reporting the average time in mil¬ 
liseconds required for each pass. Experiments used a single 
Nvidia GTX Titan Black using CuDNN and a fully opti¬ 
mized version of Caffe. Results indicate that computation 
scales sub-linearly in both the number of frames stacked in 
the input layer and the number of iterations unrolled. Even 
so, models trained on a large number of stacked frames and 
unrolled for many iterations are often computationally in¬ 
tractable. Eor example a model unrolled for 30 iterations 
with 10 stacked frames would require over 56 days to reach 
10 million iterations. 

Appendix C: Experimental Details 

Policies were evaluated every 50,000 iterations by playing 
10 episodes and averaging the resulting scores. Networks 
were trained for 10 million iterations and used a replay 
memory of size 400,000. Additionally, all networks used 
ADADELTA (Zeiler 2012) optimizer with a learning rate of 
0.1 and momentum of 0.95. LSTM’s gradients were clipped 
to a value of ten to ensure learning stability. All other set¬ 
tings were identical to those given in (Mnih et al. 2015). 

All networks were trained using the Arcade Learning En¬ 
vironment ALE (Bellemare et al. 2013). The following ALE 


Backwards (ms) 

Eorwards (ms) 

Erames 

1 

4 

10 

1 

4 

10 

Baseline 

8.82 

13.6 

26.7 

2.0 

4.0 

9.0 

Unroll 1 

18.2 

22.3 

33.7 

2.4 

4.4 

9.4 

Unroll 10 

77.3 

111.3 

180.5 

2.5 

4.4 

8.3 

Unroll 30 

204.5 

263.4 

491.1 

2.5 

3.8 

9.4 


Table 2: Average milliseconds per backwards/forwards pass. 
Erames refers to the number of channels in the input im¬ 
age. Baseline is a non recurrent network (e.g. DQN). Unroll 
refers to an LSTM network backpropagated through time 
1/10/30 steps. 


options were used: color averaging, minimal action set, and 
death detection. 

DRQN is implemented in Caffe (Jia et al. 2014). The 
source is available at [removed for blind review]. 

Appendix C: Flickering Results 


Flickering 

DRQN ±std 

DQN ±std 

Asteroids 

1032 (±410) 

1010 (±535) 

Beam Rider 

618 (±115) 

1685.6 (±875) 

Bowling 

65.5 (±13) 

57.3 (±8) 

Centipede 

4319.2 (±4378) 

5268.1 (±2052) 

Chopper Cmd 

1330 (±294) 

1450 (±787.8) 

Double Dunk 

-14 (±2.5) 

-16.2 (±2.6) 

Frostbite 

414 (±494) 

436 (±462.5) 

Ice Hockey 

-5.4 (±2.7) 

-4.2 (±1.5) 

Ms. Pacman 

1739 (±942) 

1824 (±490) 

Pong 

12.1 (±2.2) 

-9.9 (±3.3) 


Table 3: Each screen is obscured with probability 0.5, re¬ 
sulting in a partially-observable, flickering equivalent of the 
standard game. Bold font indicates statistical significance.^ 



